In this project we have choosen to work with a regression problem, where we would work around the following problem statement. To solve the ploblem statement we’ll use the statistic-, ML- and deep learning-techniques we have learned so far in our study.
Problem statement: Predict sale prices of the housesales in King County.
Further we want to evaluate which of our models has the highest accuracy.
In this project we’re working with a dataset which contains the house prices for King County, Seattle, USA. It includes houses sold between May 2014 and May 2015. The dataset consist of 21.613 observations of 21 variables.
#devtools::install_github("thomasp85/patchwork")
library(tidyverse)
library(keras)
library(caret)
library(patchwork)
library(knitr)
library(kableExtra)
library(ggmap)
library(tidymodels)
library(geosphere)
library(recipes)
library(ggcorrplot)
library(ggforce)
library(ggplot2)
library(ggthemes)
Given the variables in the dataset we have choosen to recode some of them and adding three more variables, which we think would help the prediction. First we recode the variable “yr_renovated” to a dummy, 1 if the house have been renovated and 0 if not. Second is the coding of “basement” based on the variable “sqft_basement”, which is a dummy variable, 1 if the house have a basement and 0 if not. Third we create a distance variable based on the latitude and longtitude in the dataset. This variable is a measurement of the distance between centrum of seattle and the house’s locations. Finally we divide the observations into 100 clusters, which is based on the latitude and longtitude. So the data now consist of 21613 observations of 24 variables.
The final dataset for the models can be seen in the table below.
kc <- read_csv("https://raw.githubusercontent.com/LarsHernandez/M3-Project-HouseSalesKingCountry/master/kc_house_data.csv")
seattle <- c(-122.332069,47.606209)
coordinate <- as.matrix(data.frame(lon = kc$long, lat = kc$lat))
kc$distance <- round(distHaversine(coordinate, seattle, r = 6378137)/1000,2)
kc$cluster <- kmeans(cbind(kc$lat, kc$long), centers = 100, nstart = 10, iter.max = 50)$cluster
kc$renovated <- ifelse(kc$yr_renovated != 0,1,0)
kc$basement <- ifelse(kc$sqft_basement != 0,1,0)
head(kc)
## # A tibble: 6 x 25
## id date price bedrooms bathrooms sqft_living sqft_lot
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7129~ 2014-10-13 00:00:00 2.22e5 3 1 1180 5650
## 2 6414~ 2014-12-09 00:00:00 5.38e5 3 2.25 2570 7242
## 3 5631~ 2015-02-25 00:00:00 1.80e5 2 1 770 10000
## 4 2487~ 2014-12-09 00:00:00 6.04e5 4 3 1960 5000
## 5 1954~ 2015-02-18 00:00:00 5.10e5 3 2 1680 8080
## 6 7237~ 2014-05-12 00:00:00 1.23e6 4 4.5 5420 101930
## # ... with 18 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
## # condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
## # yr_built <dbl>, yr_renovated <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
## # sqft_living15 <dbl>, sqft_lot15 <dbl>, distance <dbl>, cluster <int>,
## # renovated <dbl>, basement <dbl>
After the preprocessing of the data we want to plot the distribution for some of the variables in the dataset.
range(kc$date)
## [1] "2014-05-02 UTC" "2015-05-27 UTC"
difftime(range(kc$date)[1],range(kc$date)[2])
## Time difference of -390 days
p1 <- kc %>% ggplot(aes(price)) + geom_histogram(bins=200) + labs(title = "Price distribution", y=NULL, x=NULL)
p2 <- kc %>% ggplot(aes(date)) + geom_histogram(bins=56) + labs(title = "Date of sale", y=NULL, x=NULL)
p3 <- kc %>% ggplot(aes(yr_built)) + geom_histogram(bins=115) + labs(title = "Date of construction", y=NULL, x=NULL)
p4 <- kc %>% ggplot(aes(grade)) + geom_bar() + labs(title = "House grade", y=NULL, x=NULL)
p5 <- kc %>% ggplot(aes(sqft_living)) + geom_histogram(bins=200) + labs(title = "House Squarefeet", y=NULL, x=NULL)
p6 <- kc %>% ggplot(aes(round(bathrooms))) + geom_bar() + labs(title = "House bathrooms", y=NULL, x=NULL)
p7 <- kc %>% ggplot(aes(round(condition))) + geom_bar() + labs(title = "House condition", y=NULL, x=NULL)
p8 <- kc %>% ggplot(aes(as.factor(waterfront)))+ geom_bar() + labs(title = "House waterfront", y=NULL, x=NULL)
p9 <- kc %>% ggplot(aes(round(view))) + geom_bar() + labs(title = "House view", y=NULL, x=NULL)
p2 + p3 + p1 + (p6 + p7) + (p8 + p9) + (p4 + p5)
As it can be seen from the plots above, there seems to be some patterns in the data. It can be seen that the number of housesales is quite constant around the year except in the winther mounts where it decreases a lot. It can also be see from the second plot that the construction tend to be a bit cyclical, it can be seen that in the the number of constructions decreases in periods where there is recession in the economy, which makes great sense. The rest of the plots shows the distribution of price, bathrooms, condition, waterfront, view, grade and squarefeet of the houses sold. It can be seen that there that a lot of the houses sold is quite similar because every plots has a high density, it means that there are some few houses which is quite unique, compared to the other ones.
Next we plot the mean price for 8 of the variables to try to see a pattern in the data.
d1 <- kc %>% group_by(zipcode) %>% summarize(mean = mean(price)/1000) %>%
ggplot(aes(reorder(as.factor(zipcode),mean), mean)) + geom_col() + theme(axis.text.x = element_text(angle = 90)) +
labs(title="Price by zipcode", x=NULL, y=NULL)
d2 <- kc %>% group_by(view) %>% summarize(mean = mean(price)/1000) %>%
ggplot(aes(reorder(as.factor(view),mean), mean)) + geom_col() +
labs(title="Price by view", x=NULL, y=NULL)
d3 <- kc %>% group_by(bedrooms) %>% summarize(mean = mean(price)/1000) %>%
ggplot(aes(reorder(as.factor(bedrooms),mean), mean)) + geom_col() +
labs(title="Price by number of bedrooms", x=NULL, y=NULL)
d4 <- kc %>% group_by(waterfront) %>% summarize(mean = mean(price)/1000) %>%
ggplot(aes(reorder(as.factor(waterfront),mean), mean)) + geom_col() +
labs(title="Price by waterfront", x=NULL, y=NULL)
d5 <- kc %>% group_by(condition) %>% summarize(mean = mean(price)/1000) %>%
ggplot(aes(reorder(as.factor(condition),mean), mean)) + geom_col() +
labs(title="Price by condition", x=NULL, y=NULL)
d6 <- kc %>% group_by(grade) %>% summarize(mean = mean(price)/1000) %>%
ggplot(aes(reorder(as.factor(grade),mean), mean)) + geom_col() +
labs(title="Price by grade", x=NULL, y=NULL)
d7 <- kc %>% group_by(yr_built) %>% summarize(mean = mean(price)/1000) %>%
ggplot(aes(as.numeric(yr_built), mean)) + geom_area() +
labs(title="Price by year build", x=NULL, y=NULL)
d8 <- kc %>% group_by(date) %>% summarize(mean = mean(price)/1000) %>%
ggplot(aes(date, mean)) + geom_point() + geom_smooth() +
labs(title="Price by date of sale", x=NULL, y=NULL)
(d1 / d3 / d7)
(d4 | d5 | d2) / (d6 | d8)
As it can be seen there is some clear patterns in the data, for example that the price of the house is increasing if there is waterfront or the higher the number of grade, view, condition or bedrooms is, which al makes great sense.. There doesn’t seem to be a pattern when we look at the year the house was build, the price is pretty much the same trough regardless the build-year of the house.
lat <- range(kc$lat)
lon <- range(kc$long)
coord <- as.matrix(data.frame(min = c(lon[1]-0.25, lat[1]-0.05),
max = c(lon[2]+0.05, lat[2]+0.05),
row.names = c("x","y")))
map13 <- get_stamenmap(coord, zoom = 10, maptype = "toner-lite", force=TRUE)
high <- "#084081"
low <- "#252525"
ggmap(map13) +
labs(title="Map of King County", subtitle="Empty map from stamen-map", x=NULL, y=NULL)
ggmap(map13) +
stat_density_2d(data=kc, aes(long,lat, fill=..level..), geom = "polygon", alpha = .7) +
scale_fill_viridis_c(option="cividis") +
labs(title="Map of King County - density", subtitle = "21.613 housing sales in 2014 - 2016", fill="Density of\nsales", x=NULL, y=NULL)
ggmap(map13) +
geom_point(data = kc, aes(long,lat, color=price/1000, size=price/1000), alpha = 0.5) +
labs(title="Map of King County - price", subtitle = "The mean price of all sales is 540.088, and median is 450.000", x=NULL, y=NULL) +
guides(color= guide_legend(), size=guide_legend()) +
scale_size(range = c(0, 3)) +
scale_color_viridis_c(option="cividis")
s1 <- ggmap(map13) +
geom_point(data = kc, aes(long,lat, fill=as.factor(zipcode)), alpha = 0.1, size=3, show.legend = F, color="black", shape=21) +
labs(title="Map of King County - zipcodes", subtitle = "Sales grouped by zipcodes", x=NULL, y=NULL)
s2 <- ggmap(map13) +
geom_point(data = kc, aes(long,lat, fill=as.factor(cluster)), alpha = 0.1, size=3, show.legend = F, color="black", shape=21) +
labs(title="Map of King County - clusters", subtitle = "Sales grouped by kmeans clusters", x=NULL, y=NULL)
s1 + s2
The second plot shows the density of sales, the density increases when the amount of sales increases. The third plot shows the house sales in King County, where the color indicates the price of the sale. The last plot shows sales grouped by clusters.
Next we split the price of the housesale into 15 intervals where the mean price is calculated for each. Afterwards the density of the each interval is plotted in the the map of King County.
coord <- as.matrix(data.frame(min = c(lon[1], lat[1]),
max = c(lon[2]-0.5, lat[2]),
row.names = c("x","y")))
map13 <- get_stamenmap(coord, zoom = 10, maptype = "toner-lite", force=TRUE)
kcc <- kc
kcc$cut <- cut(log(kcc$price), 15)
kcc <- kcc %>% group_by(cut) %>% mutate(mean = round(mean(price),-3))
ggmap(map13) +
stat_density_2d(data=kcc, aes(long,lat, fill=..level..), geom = "polygon", alpha = .3, show.legend = F) +
geom_point(data=kcc, aes(long,lat), alpha=0.1, size=0.5) +
scale_fill_viridis_c(option="cividis", end = 0.7) +
facet_wrap(~mean, nrow=3) +
labs(title="Price cut into 15 chunks", x=NULL, y=NULL)
As it can be seen the density falls when the mean price is increasing.
Before we start to build our different models, we want to check the correlation between the variables to see whether or not we should remove some of them.
dfcor <- kc %>% select(-c(id,date))
dfcor<-round(cor(dfcor), 2)
dfcor %>% ggcorrplot(hc.order = TRUE,
lab_size = 3,
outline.color = "black",
type = "lower",
lab = TRUE,
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726")) +
labs(title="Correlation between variables")
From the plot above it can be seen that there are a few high correlations in the dataset. e.g. there are high correlation between “sqft_living” and “sqft_above”, which indicates that they contains lot of the same information. This is the case for almost all of the high correlations, both negative and positive.
kc %>% filter(bedrooms!=33) %>% select(price, grade, sqft_living, sqft_lot, bedrooms, bathrooms, sqft_above, sqft_basement, sqft_living15, sqft_lot15) %>%
gather(variable, value, -price) %>%
ggplot(aes(value, price/1000, group=0)) + geom_point() + geom_smooth() + facet_wrap(~variable, scales="free") +
labs(title="Correlation between price and other variables", x=NULL, y=NULL)
kc %>% filter(bedrooms!=33) %>% select(price, view, waterfront, floors, condition) %>%
gather(variable, value, -price) %>%
ggplot(aes(value, price/1000, group=0)) + geom_point() + geom_smooth(method="lm") +
facet_wrap(~variable, scales="free", nrow=1) +
labs(title="Correlation between price and other variables", x=NULL, y=NULL)
The dataset is divided into test and train sets, the catagorical variables are made into dummy variables and the remaning numeric variables are normalized.
skc <- kc %>% select(-id, -date, -yr_renovated, -sqft_basement) %>%
filter(bedrooms != 33)
index <- createDataPartition(skc$price, p = 0.75, list = FALSE)
training <- skc[index,]
test <- skc[-index,]
training2 <- skc[index,]
test2 <- skc[-index,]
paste0("The training set has ", dim(training)[1],
" and the test set has ", dim(test)[1], " observations")
## [1] "The training set has 16211 and the test set has 5401 observations"
model_recipe <- recipe(price ~ ., data = training)
model_recipe_steps <- model_recipe %>%
step_num2factor(yr_built, view, waterfront, floors, yr_built, cluster, zipcode, basement, renovated) %>%
step_dummy(yr_built, view, waterfront, floors, yr_built, cluster, zipcode, basement, renovated) %>%
step_range(bedrooms, bathrooms, sqft_living, sqft_lot, condition, grade, sqft_above,
sqft_living15, sqft_lot15, distance, min = 0, max = 1)
prepped_recipe <- prep(model_recipe_steps, training = training)
training <- bake(prepped_recipe, training)
test <- bake(prepped_recipe, test)
test[is.na(test)] <- 0
x_train <- training %>% select(-price)
x_test <- test %>% select(-price)
y_train <- training %>% select(price)
y_test <- test %>% select(price)
Serveral models are made to compare their performance to the neural network regression.
Linear model
LARS (Least Angle Regression)
Elastic net
Recursive Paritioning and regression trees
Random forrest
XGB (Extreme Gradient Boosting)
KNN (Weighted K-nearest neighbor classifier)
Bagging
Cubist regression
cv <- trainControl(method = "cv", number = 5)
t0 <- Sys.time()
fit_glm <- train(price ~ .,
data = training,
trControl = cv,
method = "lm",
metric = "RMSE")
t1 <- Sys.time()
fit_lar <- train(price ~ .,
data = training,
trControl = cv,
method = 'lars',
tuneGrid = expand.grid(.fraction = seq(0.01, 0.99, length = 50)),
metric = "RMSE")
t2 <- Sys.time()
fit_ela <- train(price ~ .,
data = training,
trControl = cv,
tuneGrid = expand.grid(alpha = seq(0, 1, by = 0.1),
lambda = seq(1, 1000, by = 100)),
method = "glmnet",
family = "gaussian",
metric = "RMSE")
t3 <- Sys.time()
fit_tre <- train(price ~ .,
data = training,
trControl = cv,
method = "rpart",
tuneGrid = expand.grid(cp = 10^seq(1, -5, by = -1)),
metric = 'RMSE')
t4 <- Sys.time()
fit_raf <- train(price ~ .,
data = training,
trControl = cv,
.mtry = 4,
ntree = 25,
method = 'rf',
metric = 'RMSE')
t5 <- Sys.time()
model <- keras_model_sequential() %>%
layer_dense(input_shape = dim(x_train)[2], units = 128, activation = "relu") %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dropout(0.3) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(1)
model %>% compile(
loss = "mae",
optimizer = "adam",
metrics = list("mean_absolute_error"))
model %>% fit(
as.matrix(x_train),
as.matrix(y_train),
epochs = 100,
batch_size = 32,
validation_split = 0.1)
pred <- predict(model, x = as.matrix(x_test))
t6 <- Sys.time()
fit_xgb <- train(price ~ .,
data = training,
trControl = cv,
method = "xgbTree",
metric = "RMSE")
t7 <- Sys.time()
fit_knn <- train(price ~ .,
data = training[1:3000,],
trControl = cv,
method = "kknn",
metric = "RMSE")
t8 <- Sys.time()
fit_bag <- train(price ~ .,
data = training,
method = "treebag",
nbagg = 20,
metric = "RMSE",
trControl = cv)
t9 <- Sys.time()
fit_cub <- train(price ~ .,
data = training,
trControl = cv,
method = "cubist",
metric = "RMSE")
t10 <- Sys.time()
fit_sim <- train(price ~ bedrooms*bathrooms + sqft_living + sqft_lot + renovated + log(grade) +
view*grade*waterfront + grade*condition + log(sqft_living15) + view + cluster + distance +
sqft_lot15 + sqft_lot15^2 + lat*long*zipcode,
data = training2,
trControl = cv,
method = "lm",
metric = "RMSE")
summary(fit_sim)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1549127 -88545 -12292 71575 3515605
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.308e+10 4.102e+09 12.942 < 2e-16 ***
## bedrooms -6.047e+04 4.159e+03 -14.540 < 2e-16 ***
## bathrooms -5.881e+04 6.595e+03 -8.916 < 2e-16 ***
## sqft_living 1.468e+02 3.476e+00 42.228 < 2e-16 ***
## sqft_lot 2.039e-01 4.866e-02 4.191 2.80e-05 ***
## renovated 9.000e+04 6.989e+03 12.877 < 2e-16 ***
## `log(grade)` -2.677e+06 8.854e+04 -30.242 < 2e-16 ***
## view -8.273e+04 1.301e+04 -6.359 2.09e-10 ***
## grade 3.391e+05 1.255e+04 27.013 < 2e-16 ***
## waterfront 7.351e+05 5.674e+05 1.296 0.19513
## condition -1.337e+05 1.493e+04 -8.954 < 2e-16 ***
## `log(sqft_living15)` 5.762e+04 7.045e+03 8.180 3.05e-16 ***
## cluster 3.109e+02 5.013e+01 6.201 5.74e-10 ***
## distance -1.226e+04 2.726e+02 -44.979 < 2e-16 ***
## sqft_lot15 2.597e-02 7.311e-02 0.355 0.72240
## lat -2.448e+08 2.120e+07 -11.549 < 2e-16 ***
## long 3.818e+08 3.238e+07 11.789 < 2e-16 ***
## zipcode -4.870e+05 4.331e+04 -11.246 < 2e-16 ***
## `bedrooms:bathrooms` 2.049e+04 1.688e+03 12.137 < 2e-16 ***
## `view:grade` 1.482e+04 1.499e+03 9.892 < 2e-16 ***
## `view:waterfront` -3.432e+05 1.496e+05 -2.294 0.02181 *
## `grade:waterfront` -7.817e+04 7.183e+04 -1.088 0.27649
## `grade:condition` 2.431e+04 2.005e+03 12.123 < 2e-16 ***
## `lat:long` -9.142e+05 9.201e+04 -9.936 < 2e-16 ***
## `lat:zipcode` 1.359e+03 2.447e+02 5.554 2.84e-08 ***
## `long:zipcode` -3.448e+03 3.369e+02 -10.234 < 2e-16 ***
## `view:grade:waterfront` 5.580e+04 1.877e+04 2.973 0.00296 **
## `lat:long:zipcode` NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 176900 on 16184 degrees of freedom
## Multiple R-squared: 0.7587, Adjusted R-squared: 0.7583
## F-statistic: 1957 on 26 and 16184 DF, p-value: < 2.2e-16
t11 <- Sys.time()
model2 <- keras_model_sequential() %>%
layer_dense(units = 307, activation = "relu", input_shape = dim(x_train)[2]) %>%
layer_dropout(rate=0.2) %>%
layer_dense(units = 256, activation = "relu") %>%
layer_dense(units = 256, activation = "relu") %>%
layer_dense(units = 256, activation = "relu") %>%
layer_dropout(rate=0.2) %>%
layer_dense(units = 1)
model2 %>% compile(
loss = "mse",
optimizer = "adam",
metrics = list("mean_absolute_error"))
model2 %>% keras::fit(as.matrix(x_train), as.matrix(y_train),
epochs = 100,
validation_split = 0.2,
batch_sieze = 32,
verbose = 1)
pred2 <- predict(model2, x = as.matrix(x_test))
t12 <- Sys.time()
x_train <- training2 %>% select(-price)
y_train <- training2 %>% select(price)
x_test <- test2 %>% select(-price)
y_test <- test2 %>% select(price)
zip_train <- training2$zipcode %>% as.character() %>% as.numeric()
zip_test <- test2$zipcode %>% as.character() %>% as.numeric()
zipp <- tibble(z = zip_train) %>% mutate(z=as.factor(z)) %>% mutate(z=as.numeric(z))
zip_train <- zipp$z
main_input <- layer_input(shape = c(1), dtype = 'int32', name = 'main_input')
lstm_out <- main_input %>%
layer_embedding(input_dim = 200, output_dim = 5, input_length = 1) %>%
layer_flatten()
auxiliary_input <- layer_input(shape = c(dim(x_train)[2]), name = 'aux_input')
main_output <- layer_concatenate(c(lstm_out, auxiliary_input)) %>%
layer_dense(units = 256, activation = 'relu') %>%
layer_dropout(0.3) %>%
layer_dense(units = 256, activation = 'relu') %>%
layer_dropout(0.2) %>%
layer_dense(units = 1)
model <- keras_model(
inputs = c(main_input, auxiliary_input),
outputs = c(main_output)
)
model %>% compile(
loss = "mse",
optimizer = "adam",
metrics = "mean_absolute_error")
model %>% fit(
x = list(as.matrix(zip_train), as.matrix(x_train)),
y = as.matrix(y_train),
epochs = 100,
validation_split = 0.1,
batch_size = 20)
zipp2 <- tibble(z = zip_test) %>% mutate(z=as.factor(z)) %>% mutate(z=as.numeric(z))
zip_test <- zipp2$z
pred3 <- predict(model, x = list(as.matrix(zip_test),as.matrix(x_test)))
t13 <- Sys.time()
Time to run all models:
cat(paste0("Size of training set: ", nrow(training)," of 21613\n","\n",
"LM: ", round(difftime(t1,t0, units = "mins"),2),"\n",
"LARS: ", round(difftime(t2,t1, units = "mins"),2),"\n",
"Elastic: ", round(difftime(t3,t2, units = "mins"),2),"\n",
"Tree: ", round(difftime(t4,t3, units = "mins"),2),"\n",
"Random Forrest: ", round(difftime(t5,t4, units = "mins"),2),"\n",
"Neural Net: ", round(difftime(t6,t5, units = "mins"),2),"\n",
"XGB: ", round(difftime(t7,t6, units = "mins"),2),"\n",
"knn: ", round(difftime(t8,t7, units = "mins"),2),"\n",
"Bagged: ", round(difftime(t9,t8, units = "mins"),2),"\n",
"Cubist: ", round(difftime(t10,t9, units = "mins"),2),"\n",
"Manual LM: ", round(difftime(t11,t10, units = "mins"),2),"\n",
"Alt Neural Net: ", round(difftime(t12,t11, units = "mins"),2),"\n",
"Emb Neural Net: ", round(difftime(t13,t12, units = "mins"),2),"\n",
"Total: ", round(difftime(t13,t0, units = "mins"),2),"\n"))
## Size of training set: 16211 of 21613
##
## LM: 0.24
## LARS: 0.58
## Elastic: 0.91
## Tree: 0.3
## Random Forrest: 29.95
## Neural Net: 4.07
## XGB: 17.49
## knn: 1.49
## Bagged: 1.79
## Cubist: 29.57
## Manual LM: 0.02
## Alt Neural Net: 6.11
## Emb Neural Net: 6.87
## Total: 99.4
The mean absolute error is calculated for all the above models.
RMSE = function(m, o){
sqrt(mean((m - o)^2))
}
MAE = function(m, o){
mean(abs(m - o))
}
mae_nnn <- MAE(pred2, test$price)
mae_nnt <- MAE(pred, test$price)
mae_nne <- MAE(pred3, test2$price)
mae_ran <- MAE(rep(mean(training$price), nrow(test)), test$price)
mae_glm <- MAE(predict(fit_glm, newdata=test) %>% as.vector, test$price)
mae_lar <- MAE(predict(fit_lar, newdata=test) %>% as.vector, test$price)
mae_ela <- MAE(predict(fit_ela, newdata=test) %>% as.vector, test$price)
mae_tre <- MAE(predict(fit_tre, newdata=test) %>% as.vector, test$price)
mae_raf <- MAE(predict(fit_raf, newdata=test) %>% as.vector, test$price)
mae_xgb <- MAE(predict(fit_xgb, newdata=test) %>% as.vector, test$price)
mae_knn <- MAE(predict(fit_knn, newdata=test) %>% as.vector, test$price)
mae_bag <- MAE(predict(fit_bag, newdata=test) %>% as.vector, test$price)
mae_cub <- MAE(predict(fit_cub, newdata=test) %>% as.vector, test$price)
mae_sim <- MAE(predict(fit_sim, newdata=test2) %>% as.vector, test2$price)
df_mae <- cbind(mae_glm, mae_lar, mae_ela, mae_tre, mae_raf, mae_nnt, mae_ran, mae_xgb, mae_knn, mae_bag, mae_cub,mae_sim, mae_nnn, mae_nne)
df_mae %>% kable() %>%
kable_styling("bordered", "condensed")
| mae_glm | mae_lar | mae_ela | mae_tre | mae_raf | mae_nnt | mae_ran | mae_xgb | mae_knn | mae_bag | mae_cub | mae_sim | mae_nnn | mae_nne |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 94044.64 | 94038.69 | 94137.38 | 87884.99 | 70626.69 | 70253 | 238417.9 | 71911.16 | 146441.5 | 121587.1 | 64941.07 | 114819.2 | 88311.33 | 126277.1 |
df_mae %>%
as_tibble %>%
gather(variable, mae) %>%
mutate(Model = c("LinearModel", "LARS","ElasticNet", "RegressionTree", "RandomForrest",
"NeuralNet","MeanModel", "XGB", "KNN","BaggedTree","Cubist","ManualLM", "AltNeuralNet","EmbeddedNeuralNet")) %>%
ggplot(aes(reorder(Model, desc(mae)), mae/1000)) +
geom_col() +
geom_text(aes(y=(mae/1000)-14, label=format(round(mae/1000,1)), nsmall = 1), color="white") +
labs(title="Model performance Mean Absolute Error", subtitle="All models are run with same train/test split",
y="MAE", x=NULL) +
coord_flip()
The Neural Network regression is unable to beat the other models in optaining the lowest error. The cubist regression and the XGboost performs significantly better than the neural network.
Below the rooted mean squared error is calculated. Using this metric the Neural Network performs even worse.
rmse_nnt <- RMSE(pred, test$price)
rmse_nnn <- RMSE(pred2, test$price)
rmse_nne <- RMSE(pred3, test$price)
rmse_ran <- RMSE(rep(mean(training$price), nrow(test)), test$price)
rmse_glm <- RMSE(predict(fit_glm, newdata=test) %>% as.vector, test$price)
rmse_lar <- RMSE(predict(fit_lar, newdata=test) %>% as.vector, test$price)
rmse_ela <- RMSE(predict(fit_ela, newdata=test) %>% as.vector, test$price)
rmse_tre <- RMSE(predict(fit_tre, newdata=test) %>% as.vector, test$price)
rmse_raf <- RMSE(predict(fit_raf, newdata=test) %>% as.vector, test$price)
rmse_xgb <- RMSE(predict(fit_xgb, newdata=test) %>% as.vector, test$price)
rmse_knn <- RMSE(predict(fit_knn, newdata=test) %>% as.vector, test$price)
rmse_bag <- RMSE(predict(fit_bag, newdata=test) %>% as.vector, test$price)
rmse_cub <- RMSE(predict(fit_cub, newdata=test) %>% as.vector, test$price)
rmse_sim <- RMSE(predict(fit_sim, newdata=test2) %>% as.vector, test$price)
df_rmse <- cbind(rmse_glm, rmse_lar, rmse_ela, rmse_tre, rmse_raf, rmse_nnt, rmse_ran, rmse_xgb, rmse_knn, rmse_bag, rmse_cub, rmse_sim, rmse_nnn, rmse_nne)
df_rmse %>% kable() %>%
kable_styling("bordered", "condensed")
| rmse_glm | rmse_lar | rmse_ela | rmse_tre | rmse_raf | rmse_nnt | rmse_ran | rmse_xgb | rmse_knn | rmse_bag | rmse_cub | rmse_sim | rmse_nnn | rmse_nne |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 168666.6 | 168654.8 | 168592.5 | 170120.5 | 139898.8 | 129426 | 388201 | 133206.7 | 286529 | 203099.9 | 122179.8 | 192299.1 | 144826.6 | 208679.2 |
df_rmse %>%
as_tibble %>%
gather(variable, rmse) %>%
mutate(Model = c("LinearModel", "LARS","ElasticNet", "RegressionTree", "RandomForrest",
"NeuralNet","MeanModel", "XGB", "KNN","BaggedTree","Cubist","ManualLM","AltNeuralNet","EmbeddedNeuralNet")) %>%
ggplot(aes(reorder(Model, desc(rmse)), rmse/1000)) +
geom_col() +
geom_text(aes(y=(rmse/1000)-16, label=format(round(rmse/1000,1)), nsmall = 1), color="white") +
labs(title="Model performance Root Mean Squared Error", subtitle="All models are run with same train/test split",
y="RMSE", x=NULL) +
coord_flip()
We can conclude that to the given problem statement, it’s the cubist-model which best predict the sale prices of the housesales in King County. This is based on the root mean squared error (RMSE) and mean absolute error (MAE) where the model has the lowest score in both. As seen with the included models you can always do some more finetuning on your choosen model, which of course improve the predicting. This isn’t surprisingly since the more complex a models gets, the better it would properly score, which match our findings. The craftsmanship is then to identify the right model to the specific task at hand.